#5 Basic NLP with the Command-line
Faculty of Humanities and Social Sciences
University of Lucerne
21 March 2024
Historical development of Swiss party politics (Tagesanzeiger)
.txt)
.csv, .tsv, .xml)
.tsv file= n_occurrences= n_occurrences / n_total_wordsPrint the following sentence in your command line using echo.
How many words are in this sentence? Use the pipe operator | to pass the output above to the command wc.
Match the words computational and colorize its occurences in the sentence using egrep.
↪️ Continue on the next slide.
Get the frequencies of each word in this sentence using tr and other commands.
Save the frequencies into a tsv-file, open it in a spreadsheet programm (e.g., Excel, Numbers) and compute the relative frequency per word.
Are there some words that, although different, should be considered as the same?
We searched for exact matches until now. But …
How to find all words starting with the letter A?
… specific parts in texts
\w represents all alphanumeric characters# count political areas by looking up words ending with "politik"
egrep -rioh "\w*politik" **/*.txt | sort | uniq -c | sort -h
# count ideologies/concepts by looking up words ending with "ismus"
egrep -rioh "\w*ismus" **/*.txt | sort | uniq -c | sort -h
# arguments:
# -o ouput only match and not entire lineX times* zero or any number? zero or one+ one or more{n}, {min,max} a specified number of times⚠️ Do not confuse regex with Bash wildcards!
[...] any of the characters between brackets
[auoei][0-9][A-Z][a-z]. matches any character (excl. newline)\ escapes to match literal
\. means the literal . instead of “any symbol”\w matches any alpha-numeric character
[A-Za-z0-9_]\s matches any whitespace (space, newline, tab)
[ \t\n].* 💪Match any character any times
Go to the website https://www.swissinfo.ch/ger and copy a few paragraphs of any article.
After that, go to the website https://regex101.com/and paste the text into the big white field.
Write various regex patterns in the small field to match
TODO
Come up with your own challenge
When you look for useful primers on Bash, consider the following resources:
git pull. When you haven’t cloned the repository yet, follow section 5 of the installation guide .ked2024/materials/data/swiss_party_programmes/txt. Change into that directory using cd.more.Compare the absolute frequencies of single terms or multi-word phrases of your choice (e.g., Ökologie, Sicherheit, Schweiz)…
Use the file names as filter to get various aggregation of the word counts.
Pick terms of your interest and look at their contextual use by extracting relevant passages. Does the usage differ across parties or time?